title: “EDA of Trending YouTube Video Statistics” author: “Karra Anand” date: “18 March 2018” output: html_document

About the dataset

YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. This dataset is a daily record of the top trending YouTube videos.
This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for the USA, with up to 200 listed trending videos per day.
More details about this dataset are present in About_datasat.txt included with this project



Preprocessing

Before we move on to plotting and analyzing the data, let us see if the data requires any cleaning.

The first 2 rows of each of the columns of the dataset are as follow:

##      video_id trending_date
## 1 2kyS6SvSYSE      17.14.11
## 2 1ZAPwfrtAFY      17.14.11
##                                                            title
## 1                             WE WANT TO TALK ABOUT OUR MARRIAGE
## 2 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
##     channel_title category_id             publish_time
## 1    CaseyNeistat          22 2017-11-13T17:13:01.000Z
## 2 LastWeekTonight          24 2017-11-13T07:30:00.000Z
##                                                                                               tags
## 1                                                                                  SHANtell martin
## 2 last week tonight trump presidency|last week tonight donald trump|john oliver trump|donald trump
##     views likes dislikes comment_count
## 1  748374 57527     2966         15954
## 2 2418783 97185     6146         12703
##                                   thumbnail_link comments_disabled
## 1 https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg             False
## 2 https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg             False
##   ratings_disabled video_error_or_removed
## 1            False                  False
## 2            False                  False
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     description
## 1 SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\\nCANDICE - https://www.lovebilly.com\\n\\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\\nwith this lens -- http://amzn.to/2rUJOmD\\nbig drone - http://tinyurl.com/h4ft3oy\\nOTHER GEAR ---  http://amzn.to/2o3GLX5\\nSony CAMERA http://amzn.to/2nOBmnv\\nOLD CAMERA; http://amzn.to/2o2cQBT\\nMAIN LENS; http://amzn.to/2od5gBJ\\nBIG SONY CAMERA; http://amzn.to/2nrdJRO\\nBIG Canon CAMERA; http://tinyurl.com/jn4q4vz\\nBENDY TRIPOD THING; http://tinyurl.com/gw3ylz2\\nYOU NEED THIS FOR THE BENDY TRIPOD; http://tinyurl.com/j8mzzua\\nWIDE LENS; http://tinyurl.com/jkfcm8t\\nMORE EXPENSIVE WIDE LENS; http://tinyurl.com/zrdgtou\\nSMALL CAMERA; http://tinyurl.com/hrrzhor\\nMICROPHONE; http://tinyurl.com/zefm4jy\\nOTHER MICROPHONE; http://tinyurl.com/jxgpj86\\nOLD DRONE (cheaper but still great);http://tinyurl.com/zcfmnmd\\n\\nfollow me; on http://instagram.com/caseyneistat\\non https://www.facebook.com/cneistat\\non https://twitter.com/CaseyNeistat\\n\\namazing intro song by https://soundcloud.com/discoteeth\\n\\nad disclosure.  THIS IS NOT AN AD.  not selling or promoting anything.  but samsung did produce the Shantell Video as a 'GALAXY PROJECT' which is an initiative that enables creators like Shantell and me to make projects we might otherwise not have the opportunity to make.  hope that's clear.  if not ask in the comments and i'll answer any specifics.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              One year after the presidential election, John Oliver discusses what we've learned so far and enlists our catheter cowboy to teach Donald Trump what he hasn't.\\n\\nConnect with Last Week Tonight online...\\n\\nSubscribe to the Last Week Tonight YouTube channel for more almost news as it almost happens: www.youtube.com/user/LastWeekTonight\\n\\nFind Last Week Tonight on Facebook like your mom would: http://Facebook.com/LastWeekTonight\\n\\nFollow us on Twitter for news about jokes and jokes about news: http://Twitter.com/LastWeekTonight\\n\\nVisit our official site for all that other stuff at once: http://www.hbo.com/lastweektonight


In many of the following data cleaning steps only the code but not the output is printed to prevent repeated printing of the same dataset with minor modifications. The final dataset obtained after the data cleaning is printed at the end of this section.


Cleaning irrelevant columns

As we are only interested in exploratory analysis of the data, we remove the columns of tags, thumbnail_link and description since they are irrelevant to us.

##      video_id trending_date
## 1 2kyS6SvSYSE      17.14.11
## 2 1ZAPwfrtAFY      17.14.11
## 3 5qpjK5DgCt4      17.14.11
## 4 puqaWrEC7tY      17.14.11
## 5 d380meD0W0M      17.14.11
## 6 gHZ1Qz0KiKM      17.14.11
##                                                            title
## 1                             WE WANT TO TALK ABOUT OUR MARRIAGE
## 2 The Trump Presidency: Last Week Tonight with John Oliver (HBO)
## 3          Racist Superman | Rudy Mancuso, King Bach & Lele Pons
## 4                               Nickelback Lyrics: Real or Fake?
## 5                                       I Dare You: GOING BALD!?
## 6                                          2 Weeks with iPhone X
##           channel_title category_id             publish_time   views
## 1          CaseyNeistat          22 2017-11-13T17:13:01.000Z  748374
## 2       LastWeekTonight          24 2017-11-13T07:30:00.000Z 2418783
## 3          Rudy Mancuso          23 2017-11-12T19:05:24.000Z 3191434
## 4 Good Mythical Morning          24 2017-11-13T11:00:04.000Z  343168
## 5              nigahiga          24 2017-11-12T18:01:41.000Z 2095731
## 6              iJustine          28 2017-11-13T19:07:23.000Z  119180
##    likes dislikes comment_count comments_disabled ratings_disabled
## 1  57527     2966         15954             False            False
## 2  97185     6146         12703             False            False
## 3 146033     5339          8181             False            False
## 4  10172      666          2146             False            False
## 5 132235     1989         17518             False            False
## 6   9763      511          1434             False            False
##   video_error_or_removed
## 1                  False
## 2                  False
## 3                  False
## 4                  False
## 5                  False
## 6                  False
## [1] 23362    13

Converting into a data.table

From the dimensions of the dataframe in the above output, we have the data for 23,362 videos(assuming there are no duplicates across 13 features; we shall investigate this the following sections). Hence, we use the data.table data structure to store our data as it superior to a dataframe for add, remove, update, join etc. operations.

yt_trending <- data.table(yt_trending)

Adding category names

Further we observe that, we have names of each of the category_id (as part of US_category_id.json file). Hence, adding the column for category_name.

category_id <- c(1,2,10,15,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,
                  34,35,36,37,38,39,40,41,42,43,44)
category_name <- c("Film & Animation","Autos & Vehicles","Music",
                    "Pets & Animals","Sports","Short Movies",
                    "Travel & Events","Gaming","Videoblogging",
                    "People & Blogs","Comedy","Entertainment",
                    "News & Politics","Howto & Style","Education",
                    "Science & Technology",
                    "Nonprofits & Activism","Movies","Anime/Animation",
                    "Action/Adventure","Classics","Comedy",
                    "Documentary","Drama","Family","Foreign","Horror",
                    "Sci-Fi/Fantasy","Thriller","Shorts","Shows","Trailers")

category_id_names <- data.frame(category_id,category_name)

yt_trending <- merge(yt_trending,category_id_names)

yt_trending <- yt_trending[order(yt_trending$views,decreasing = TRUE),]

Removing duplicate video entries

We sort the videos in decreasing order of their number of days on trending.

yt_trending <- unique(yt_trending, by = c("video_id"))

head(yt_trending)
##       video_id category_id trending_date
## 1: sXP6vliZIHI          22      18.04.01
## 2: H0g4JxKp4fc          23      18.12.03
## 3: CwKp6Xhy3_4          10      18.12.03
## 4: kCg5D8KMqk4          26      18.07.03
## 5: E_ViwNxUldw          26      18.07.03
## 6: vQiiNGllGQo          15      18.12.03
##                                                         title
## 1: Cardi B - Bartier Cardi (feat. 21 Savage) [Official Audio]
## 2:                                                    *cough*
## 3:                                   Chris Young - Hangin' On
## 4:                                 MY EVERYDAY MAKEUP ROUTINE
## 5:                          Clear crisps / Glass Potato Chips
## 6:              Elderly man making sure his dog won't get wet
##              channel_title             publish_time    views  likes
## 1:                 Cardi B 2017-12-22T05:00:02.000Z 17540613 380464
## 2:              jacksfilms 2018-02-26T19:00:02.000Z  2292736 224986
## 3:          ChrisYoungVEVO 2018-02-26T08:00:02.000Z  1117570   7504
## 4:                 LaurDIY 2018-02-21T23:00:04.000Z  1006188  46829
## 5:       My Virgin Kitchen 2018-02-21T20:07:18.000Z   784713  12069
## 6: Rock me, Joey Santiago. 2018-02-26T11:09:32.000Z   713574  12448
##    dislikes comment_count comments_disabled ratings_disabled
## 1:    20697         29122             False            False
## 2:     8689         41467             False            False
## 3:      584           324             False            False
## 4:      710          9653             False            False
## 5:     1274          1453             False            False
## 6:      146          1474             False            False
##    video_error_or_removed  category_name days_on_trending
## 1:                  False People & Blogs               14
## 2:                  False         Comedy               14
## 3:                  False          Music               14
## 4:                  False  Howto & Style               14
## 5:                  False  Howto & Style               14
## 6:                  False Pets & Animals               14


Univariate Plots Section

## [1] 4712   15
##  [1] "video_id"               "category_id"           
##  [3] "trending_date"          "title"                 
##  [5] "channel_title"          "publish_time"          
##  [7] "views"                  "likes"                 
##  [9] "dislikes"               "comment_count"         
## [11] "comments_disabled"      "ratings_disabled"      
## [13] "video_error_or_removed" "category_name"         
## [15] "days_on_trending"
##         video_id     category_id     trending_date 
##  00nmxR1mxIA:   1   Min.   : 1.00   18.12.03: 199  
##  00RpZZThSAs:   1   1st Qu.:17.00   18.09.01: 141  
##  01AEuxSlIMg:   1   Median :24.00   18.01.02:  84  
##  02e9klKUN0Y:   1   Mean   :20.44   17.13.12:  70  
##  02N508BDngc:   1   3rd Qu.:25.00   17.14.11:  69  
##  032BPsxhreM:   1   Max.   :43.00   17.22.11:  68  
##  (Other)    :4706                   (Other) :4081  
##                                                                                           title     
##  DORITOS BLAZE vs. MTN DEW ICE | Super Bowl Commercial with Peter Dinklage and Morgan Freeman:   2  
##  Justice League - Movie Review                                                               :   2  
##  Maroon 5 - Wait                                                                             :   2  
##  Missouri Star Quilt Company Live Stream                                                     :   2  
##  NBA Bloopers - The Starters                                                                 :   2  
##  Selena Gomez, Marshmello - Wolves                                                           :   2  
##  (Other)                                                                                     :4700  
##                                 channel_title 
##  The Tonight Show Starring Jimmy Fallon:  51  
##  ESPN                                  :  46  
##  TheEllenShow                          :  44  
##  Jimmy Kimmel Live                     :  42  
##  Netflix                               :  42  
##  The Late Show with Stephen Colbert    :  41  
##  (Other)                               :4446  
##                    publish_time      views               likes        
##  2017-11-17T05:00:00.000Z:   4   Min.   :      559   Min.   :      0  
##  2017-11-17T05:00:01.000Z:   3   1st Qu.:    95075   1st Qu.:   1600  
##  2017-12-13T15:00:01.000Z:   3   Median :   331606   Median :   7726  
##  2018-01-12T05:00:01.000Z:   3   Mean   :  1277663   Mean   :  39715  
##  2018-02-16T14:00:03.000Z:   3   3rd Qu.:  1025326   3rd Qu.:  25876  
##  2017-11-10T05:00:01.000Z:   2   Max.   :149376127   Max.   :3093544  
##  (Other)                 :4694                                        
##     dislikes       comment_count       comments_disabled ratings_disabled
##  Min.   :      0   Min.   :      0.0   False:4633        False:4686      
##  1st Qu.:     79   1st Qu.:    238.0   True :  79        True :  26      
##  Median :    302   Median :    888.5                                     
##  Mean   :   2598   Mean   :   4975.6                                     
##  3rd Qu.:   1058   3rd Qu.:   2914.2                                     
##  Max.   :1674420   Max.   :1361580.0                                     
##                                                                          
##  video_error_or_removed         category_name  days_on_trending
##  False:4711             Entertainment  :1141   Min.   : 1.000  
##  True :   1             Music          : 585   1st Qu.: 3.000  
##                         News & Politics: 438   Median : 5.000  
##                         Howto & Style  : 436   Mean   : 4.958  
##                         Comedy         : 390   3rd Qu.: 7.000  
##                         People & Blogs : 368   Max.   :14.000  
##                         (Other)        :1354

As we can see from the above output, we have the data of 4,712 videos and their 15 features.

##      Action/Adventure       Anime/Animation      Autos & Vehicles 
##                     0                     0                    66 
##              Classics                Comedy           Documentary 
##                     0                   390                     0 
##                 Drama             Education         Entertainment 
##                     0                   186                  1141 
##                Family      Film & Animation               Foreign 
##                     0                   237                     0 
##                Gaming                Horror         Howto & Style 
##                    57                     0                   436 
##                Movies                 Music       News & Politics 
##                     0                   585                   438 
## Nonprofits & Activism        People & Blogs        Pets & Animals 
##                    13                   368                   116 
##  Science & Technology        Sci-Fi/Fantasy          Short Movies 
##                   308                     0                     0 
##                Shorts                 Shows                Sports 
##                     0                     2                   320 
##              Thriller              Trailers       Travel & Events 
##                     0                     0                    49 
##         Videoblogging 
##                     0

Plotting the number of videos as per their category name, we can see a large amount of variation in the number of videos in each category. We can clearly observe that Entertainment and Music are the top 2 categories with 1141 and 585 videos respectively closely followed by Howto & Style with 436 videos. We can also see their respective shares out of the total.

Also, some categories like Action/Adventure, Anime/Animation etc. have no videos.


## [1] "Summary of views feature:"
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       559     95070    331600   1278000   1025000 149400000
## [1] "Summary of likes feature:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1600    7726   39720   25880 3094000
## [1] "Summary of dislikes feature:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      79     302    2598    1058 1674000
## [1] "Summary of comments feature:"
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       0.0     238.0     888.5    4976.0    2914.0 1362000.0

This plot shows the log10 of the number of views, likes, dislikes and comments on the videos.

We can observe a distribution similar to that of a normal distribution. Further, we can see that, the mean number of views is more that that of any other attribute.

The mean, median and maximum number of view are 1,278,000, 331,600 and 149,400,000 respectively.

The mean, median and maximum number of likes are 39,720, 7,726 and 3,094,000 repectively.

The mean, median and maximum number of dislikes are 2,598, 302 and 1,674,000 respectively.

The mean, median and maximum number of comments are 4,976, 888 and 1,362,000 respectively.


## [1] "Summary of videos with comments disabled"
## False  True 
##  4633    79
## [1] "Summary of videos with ratings disabled"
## False  True 
##  4686    26

So we have only 79 videos which have their comments disabled and only 26 videos with their ratings disabled.


##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   4.958   7.000  14.000

We can observe a bimodal type of distribution for the days on trending.

With videos trending for an average of about 5 days and a maximum of 14 days.


Univariate Analysis

What is the structure of your dataset?

The dataset after preprocessing to remove duplicates, contains the data for 4712 videos and 15 features (video_id, category_id, trending_date, title, channel_title, publish_time, views, likes, dislikes, comment_count, comments_disabled, ratings_disabled, video_error_or_removed, category_name, days_on_trending)

Unordered factors: category_id, category_name.
category_id: 1,2,10,15,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33, 34,35,36,37,38,39,40,41,42,43,44
category_name: “Film & Animation”,“Autos & Vehicles”,“Music”, “Pets & Animals”,“Sports”,“Short Movies”, “Travel & Events”,“Gaming”,“Videoblogging”, “People & Blogs”,“Comedy”,“Entertainment”, “News & Politics”,“Howto & Style”,“Education”, “Science & Technology”, “Nonprofits & Activism”,“Movies”,“Anime/Animation”, “Action/Adventure”,“Classics”,“Comedy”, “Documentary”,“Drama”,“Family”,“Foreign”,“Horror”, “Sci-Fi/Fantasy”,“Thriller”,“Shorts”,“Shows”,“Trailers”

Other observations:

  • Entertainment is the category with the highest number of video in trending with a total of 1141 videos.
  • Average number of views on a trending video are 1,278,000.
  • The maximum number of days for which a video stayed on trending is 14.
  • Average number of views on a video in the top 100 trending videos is 3,411,000.
  • Most of the videos in the top 100 trending videos stay on trending for 13 days.

What is/are the main feature(s) of interest in your dataset?

The number of views (views), number of days the video is trending (days_on_trending) and category of the video (category_name) are the main features of interest here.


What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Although subject to the viewers’ biases, number of likes, dislikes and comments can also help in understanding the video’s position on the list of trending videos.


Did you create any new variables from existing variables in the dataset?

Yes, the variable category_name was created to better interpret the category_id feature which has a direct correspondance with category_name.
Also, the days_on_trending variable was created to keep track of the number of days the video was on trending.


Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

One of the unusual observations were that there were no videos from some of the categories like Action/Adventure, Anime/Animation etc.

Some features like tags, thubnail_link and description were removed from the dataset as they took a lot of space on printing and were irrelevant to our analysis.

Log10 transform was applied in multiple plots to convey the scale and variation in the data as required.

Also, the same video which as trending on nultiple days was reduced to a single entry on the day it had the highest views.



Bivariate Plots Section

The above correlation matrix helps in identifying some of the interesting trends in the data.

We have a high correlation between


But, before we plot scatter plots to visualize these correlations, we have to normalize the data ranges of the above mentioned four features.


After normalizing to be in the range of [0,1]. We get the following output:

##          views       likes     dislikes comment_count
## 1: 0.117422509 0.122986452 1.236070e-02  0.0213883870
## 2: 0.015345060 0.072727590 5.189260e-03  0.0304550596
## 3: 0.007477869 0.002425697 3.487775e-04  0.0002379588
## 4: 0.006732219 0.015137654 4.240274e-04  0.0070895577
## 5: 0.005249547 0.003901351 7.608605e-04  0.0010671426
## 6: 0.004773304 0.004023864 8.719437e-05  0.0010825658

Using suitable limits for the X and Y axis:

We can clearly observe the correlation that we found out previously using the correlation matrix.

We can see the variation in the features of likes, dislikes, comment count and days on trending in the following plots.

The notable but not considerable correlations like in between views and dislikes and that between views and comment count are visible here.

As the number of views increases, the other features also increase which is in agreement with our calculated statistics in the univariate plots section.

Moreover, only a small number of videos trend for 10 days or more.

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

The categories of Entertainment and Music have very high values for all the features across the board.

Nonprofits & Activism have the lowest median and 1st quantile values for nearly all the features.

Shows show a very small difference between the 3rd and 1st quantile for all the features. They also have the highest median, 1st and 3rd quantile value for days on trending.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

We have a high correlation between:

  • the number of views and number of likes, and
  • the number of dislikes and comment count

in the dataset containing the all of the data but not in the Top 100 Trending Videos.

Nonprofits & Activism have the lowest median and 1st quantile values for nearly all the features.

Shows show a very small difference between the 3rd and 1st quantile for all the features. They also have the highest median, 1st and 3rd quantile value for days on trending.

Nonprofits & Activism have the lowest median and 1st quantile values for nearly all the features.

Shows show a very small difference between the 3rd and 1st quantile for all the features. They also have the highest median, 1st and 3rd quantile value for days on trending.

In the Top 100 Trending Videos, the trend of Entertainment and Music having values across all the features still continues but is a lot less pronounced with other categories like Film/Animation and People & Blogs coming close.

Pets & Animals is surprising the category with the highest median, 1st and 3rd quantile values in the Top 100 Trending Videos.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

In the Top 100 Trending Videos subset of the dataset, the Pets & Animals category has the highest median, 1st and 3rd quantile values.

What was the strongest relationship you found?

The strongest correlation was between views and likes with 0.83 in the complete dataset and 0.85 in the Top 100 Trending Videos.